npj Digital Medicine — Latest Matching Preprints

1

Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images

Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.

2026-04-24 health informatics 10.64898/2026.04.23.26351616 medRxiv

Top 0.1%

42.8%

Show abstract

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.

2

Interpretable AI for Accelerated Video-Based Surgical Skill Assessment: A Highlights-Reel Approach

Lafouti, M.; Feldman, L. S.; Hooshiar, A.

2026-04-20 medical education 10.64898/2026.04.18.26351193 medRxiv

Top 0.1%

29.7%

Show abstract

BackgroundManual video-based evaluation of surgical skills can be time-consuming and delays trainee feedback. Artificial intelligence (AI) offers opportunities to automate aspects of assessment while maintaining clinician oversight. We developed an interpretable spatiotemporal model that classifies surgical expertise directly from endoscopic video in standardized training tasks and generates saliency-based "highlights reels" showing the most influential frames. MethodsAn RGB pipeline combining InceptionV3 for spatial feature extraction and a gated recurrent unit (GRU) for temporal modeling was trained on the JIGSAWS dataset. The model outputs novice, intermediate, or expert labels. A rolling-window, low-latency evaluation at 30 fps with a stride of 10 frames was used. A motion-augmented variant fused RGB with optical-flow features. Spatial and temporal saliency maps highlighted key decision-making regions. ResultsThe RGB model achieved 95% accuracy (F1: 92% expert, 86% intermediate, 99% novice). Performance was strongest for novice and expert trials, while intermediate trials showed the lowest recall, consistent with greater ambiguity around the intermediate skill level. Saliency maps consistently emphasized tool-tissue interactions and peaked during technically demanding phases. The optical-flow variant underperformed, approximately 38% accuracy, which may reflect sensitivity to global camera motion and other non-informative motion patterns. ConclusionsThis interpretable AI pipeline accurately classifies surgical skill while producing intuitive visual highlights. Future work will refine highlight thresholds and validate on laparoscopic inguinal hernia repair for realworld deployment.

3

Individualized Forecasting of Headache Attack Risk Using a Continuously Updating Model

Houle, T. T.; Lebowitz, A.; Chtay, I.; Patel, T.; McGeary, D. D.; Turner, D. P.

2026-04-22 neurology 10.64898/2026.04.20.26350119 medRxiv

Top 0.1%

28.9%

Show abstract

ImportanceMigraine attacks often occur unpredictably, limiting the ability of individuals to initiate timely preventive or preemptive treatment. Short-term probabilistic forecasting of migraine risk could enable more targeted management strategies. ObjectiveTo externally validate the previously developed Headache Prediction Model (HAPRED-I), evaluate an updated continuously learning model (HAPRED-II), and assess the feasibility and short-term safety of delivering individualized probabilistic migraine forecasts directly to patients. Design, Setting, and ParticipantsProspective 8-week cohort study conducted remotely at two academic medical centers in the United States (Massachusetts General Hospital and Wake Forest Health Sciences) between 2015 and 2019. Adults with recurrent migraine or tension-type headache completed twice-daily electronic diaries. A total of 230 participants contributed 23,335 diary entries across 11,862 participant-days of observation. Main Outcomes and MeasuresOccurrence of a headache attack within 24 hours following each evening diary entry. Model performance was evaluated using discrimination (area under the receiver operating characteristic curve [AUC]) and calibration. ResultsExternal validation of HAPRED-I demonstrated modest discrimination (AUC, 0.59; 95% CI, 0.57-0.61) and poor calibration, with predicted probabilities consistently exceeding observed headache risk. In contrast, the continuously updating HAPRED-II model demonstrated progressive improvement in predictive performance as participant-specific data accumulated. Discrimination increased from an AUC of 0.59 (95% CI, 0.57-0.61) during the first 14 days to 0.66 (95% CI, 0.63-0.70) after the first month, accompanied by improved calibration across predicted risk levels. Over the study period, 6999 individualized forecasts were delivered directly to participants. No evidence suggested that receipt of forecasts was associated with increasing headache frequency or worsening predicted headache risk trajectories. Conclusions and RelevanceA static migraine forecasting model demonstrated limited transportability to new individuals. In contrast, models that continuously update within individuals may improve predictive accuracy over time and enable real-time delivery of personalized migraine risk forecasts. Further work incorporating richer physiologic and contextual predictors will likely be necessary before such systems can reliably guide clinical treatment decisions.

4

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv

Top 0.1%

28.8%

Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

5

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv

Top 0.1%

28.4%

Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

6

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv

Top 0.1%

27.4%

Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

7

Human vs AI Clinical Assessment: Benchmarking a Multimodal Foundation Model Against Multi-Center Expert Judgment on the Mental Status Examination.

Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351105 medRxiv

Top 0.1%

26.0%

Show abstract

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

8

Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv

Top 0.2%

23.4%

Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

9

From Protocol to Practice: Graded Sepsis Bundle Compliance and Actionable Insights from Real-World ICU Data

TRIPATHI, H.; Roy, K.; Rahimi, S.; Neupane, S.; Bozorgzad, S.

2026-04-25 intensive care and critical care medicine 10.64898/2026.04.23.26351412 medRxiv

Top 0.2%

22.9%

Show abstract

Sepsis is a leading cause of in-hospital mortality, yet systematically evaluating temporal adherence to the Surviving Sepsis Campaign (SSC) bundle across large patient populations remains difficult due to semantic variability in electronic health records and the loss of clinical nuance inherent in binary pass/fail compliance judgments. We present an expert-guided neuro-symbolic pipeline that pairs LLM-based semantic normalization with a Sugeno fuzzy inference system encoding eight SSC bundle rules, producing graded per-episode compliance scores whose clinical decision boundaries are set through domain expert consultation. Applied to 2,438 sepsis episodes from MIMIC-IV v3.1, the dual-classifier normalization layer achieves substantial inter-system agreement with high embedding-based confirmation, resolving hundreds of clinically relevant drug strings that purely symbolic systems miss. The graded framework reveals that Hour-1 bundle failures, particularly antibiotic timing, are the dominant driver of low overall compliance, and that higher bundle adherence is associated with notably shorter ICU stays, with antibiotic delays beyond six hours increasing median stays by 61%. These results demonstrate that neuro-symbolic graded assessment can surface actionable compliance patterns that binary evaluation frameworks cannot capture.

10

Research Paper on AuditMed: A Single-File, Browser-Based Clinical Evidence Audit Platform Architecture, Current Capabilities, and Proposed Applications in Drug Informatics and Pharmacy Education

Ferguson, D. J.

2026-04-20 health informatics 10.64898/2026.04.19.26351188 medRxiv

Top 0.2%

22.8%

Show abstract

BackgroundClinical pharmacists, trainees, and educators rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions. Existing workflows require navigation across PubMed, DailyMed, LactMed, interaction checkers, and specialty guideline repositories with manual de-duplication, appraisal, and synthesis. Commercial platforms that integrate these functions are costly and often unavailable in community, rural, and international training contexts. ObjectiveThis report describes the architecture of AuditMed, a single-file, browser-based clinical evidence audit platform, and reports preliminary stress-test results against a complex multi-morbidity case corpus. AuditMed is intended for research and educational use and is not a substitute for clinical judgment or validated commercial clinical decision-support systems. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical application programming interfaces into a six-stage Search [->] Select [->] Parse [->] Analyze [->] Infer [->] Create pipeline and supports browser-local patient-case ingestion with regex-based HIPAA Safe Harbor de-identification. Preliminary stress-testing was conducted against eleven cases (Cases 30 through 40) from the Complex Clinical Case Compendium Software Validation Suite, each featuring over twenty concurrent active disease states. For each case, the one-click inference pipeline was executed with default settings and the full Clinical Inference Report was captured verbatim. No retrieval-sensitivity, synthesis-fidelity, or time-to-answer endpoints were pre-specified; the exercise was qualitative and oriented toward pipeline behavior under extreme multi-morbidity. ResultsThe pipeline completed without fatal errors for all eleven cases and produced a structured Clinical Inference Report in each instance. Quantitative-finding detection performed as designed for hematologic parameters and cardiac biomarkers. Two parser defects were identified and are reproduced in the appendix: an age-as-fever regex-precedence defect affecting seven cases and a diagnosis-versus-medication parsing defect affecting one case. Evidence-linkage rate varied from zero evidence-linked statements in seven cases to eleven in one case, reflecting dependence of the inference layer on MeSH-indexed literature coverage of the specific case diagnoses. ConclusionsAuditMed is an early-stage, open-source platform whose value at this stage is in providing a free, transparent, auditable workflow for multi-source evidence synthesis with explicit uncertainty flagging. The preliminary results document both robust end-to-end completion under extreme case complexity and specific, reproducible parser defects that will be addressed before formal evaluation. Planned evaluation studies are described.

11

Generalizing intensive care AI across time scales in resource-limited settings

Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.

2026-04-24 health informatics 10.64898/2026.04.23.26351588 medRxiv

Top 0.2%

22.6%

Show abstract

Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.

12

Multimodal Integration of Ambulatory ECG and Clinical Features for Sudden Cardiac Death and Pump Failure Death Prediction

Swee, S.; Adam, I.; Zheng, E. Y.; Ji, E.; Wang, D.; Speier, W.; Hsu, J.; Chang, K.-W.; Shivkumar, K.; Ping, P.

2026-04-22 cardiovascular medicine 10.64898/2026.04.21.26351421 medRxiv

Top 0.2%

22.6%

Show abstract

Ambulatory electrocardiograms (ECG) provides continuous monitoring of the hearts electrical activity. However, many existing machine learning and artificial intelligence models for analyzing ambulatory ECG traces are often unimodal and do not incorporate patient clinical context. In this study, we propose a multimodal framework integrating ambulatory ECG-derived representations with clinical text embeddings to predict two cardiac outcomes: sudden cardiac death and pump failure death. Ambulatory ECG traces are preprocessed, segmented, and encoded via a multiple instance learning and temporal convolutional neural network framework. In parallel, patient clinical features are parsed into structured prompts, which are passed through a large language model to generate clinical reasoning; this reasoning passes through a biomedical language encoder to generate a text embedding. With the ECG and text embeddings, we systematically evaluate multiple fusion strategies, including concatenation- and gating-based approaches, to integrate these two data modalities. Our results demonstrate that multimodal models consistently outperform unimodal baselines, with adaptive fusion mechanisms providing the greatest improvements in predictive performance. Decision curve analysis highlights the potential clinical utility of the proposed framework for risk stratification. Finally, we visualize model attention across modalities, including ECG attention patterns, segment-level saliency, heart rate variability features, and clinical reasoning, to contextualize patient-specific predictions.

13

The FEES Dysphagia Index: a bias-resilient continuous score that captures expert clinical judgment in 2,943 neurological inpatients

Werner, C. J.; Sanchez-Garcia, E.; Mall, B.; Meyer, T.; Pinho, J.; Schulz, J. B.; Schumann-Werner, B.

2026-04-21 neurology 10.64898/2026.04.20.26351259 medRxiv

Top 0.2%

22.5%

Show abstract

Multi-consistency testing during flexible endoscopic evaluation of swallowing (FEES) is clinically necessary but introduces selection bias: worst scores inflate severity because the number of consistencies tested covaries with disease severity. In this retrospective observational study of hospitalized neurological patients, we derived and validated the FEES Dysphagia Index (FDI) in two temporally independent cohorts (Cohort 1: 2013-2018, N=1,257; Cohort 2: 2021-2025, N=1,686) from a single center. FDI-S averages Penetration-Aspiration Scale (PAS) scores across tested consistencies (0-100 scale); FDI-E uses Yale Pharyngeal Residue scores; FDI-C combines both. Selection bias was quantified using sequential branching-tree inverse probability weighting (IPW). Worst PAS overestimated severity by 24%; FDI deviated by <2%. FDI-C was significantly superior to Worst PAS for hospital-acquired pneumonia (HAP; AUC 0.70 vs. 0.60, p<0.001), mortality (0.71 vs. 0.62, p=0.040), and restricted oral intake (0.90 vs. 0.74, p<0.001), and statistically equivalent to clinician-rated severity. FDI-C mapped linearly onto ordinal Functional Oral Intake Scale values (FOIS; proportional odds RCS p=0.99). With functional status and diagnosis, FDI-C reconstructed the clinicians oral intake recommendation with AUC up to 0.93. The FDI-C-mortality relationship was sigmoidal with a clinically relevant transition zone between [~]50 and [~]85. FDI-C is a bias-resilient, bedside-calculable score with interval-scale properties that captures expert clinical judgment, suitable as both a clinical decision support tool and a continuous research endpoint.

14

MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation

Hakata, Y.; Oikawa, M.; Fujisawa, S.

2026-04-22 health informatics 10.64898/2026.04.20.26351338 medRxiv

Top 0.2%

18.8%

Show abstract

Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.

15

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv

Top 0.3%

17.5%

Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

16

Most Instability Phases Resolve: Empirical Evidence for Trajectory Plasticity in Multimorbidity Care from Longitudinal Relational Monitoring

Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.

2026-04-24 health informatics 10.64898/2026.04.22.26351537 medRxiv

Top 0.3%

17.3%

Show abstract

Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal

17

Uncertainty-Gated Glaucoma Screening: Combining Semi-Supervised Classification with Multi-Agent Large Language Model Deliberation

Garimella Narasimha, S. V.; Brown, N.; Sridhar, S.

2026-04-20 ophthalmology 10.64898/2026.04.17.26351127 medRxiv

Top 0.3%

16.6%

Show abstract

Automated glaucoma screening from optical coherence tomography (OCT) faces two persistent challenges: scarcity of expert-labeled data and unreliable model predictions on diagnostically ambiguous cases. We present a two-tier diagnostic pipeline that addresses both. In the first tier, an EfficientNetV2-S classifier trained under a semi-supervised pseudo supervisor framework achieves 0.84 AUC on 150 held-out test patients from the Harvard Glaucoma Detection and Progression dataset, using only 350 labeled training samples out of 700. In the second tier, 124 flagged cases are routed to a multi-agent system built on MedGemma 4B, where three specialist agents deliberate over three rounds before rendering a final diagnosis. On these flagged cases, the agent system achieves 100% sensitivity--detecting all 55 glaucoma cases with zero missed diagnoses--and 89.5% overall accuracy (111/124), compared to the classifiers 73.4% (91/124). Uncertainty analysis confirms that the classifiers output probability reliably separates confident predictions (96.3% accuracy, n = 27) from uncertain ones (74.0%, n = 123), producing a 22-percentage-point gap that serves as a triage signal. The agents fix 32 cases the classifier misclassifies while introducing 12 new errors, yielding a net improvement of 20 cases. These results are from a single training run without variance estimates and should be interpreted as preliminary evidence that uncertainty-gated routing to vision-language model agents can meaningfully improve diagnostic accuracy on the cases where automated classifiers are least reliable.

18

CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-04-23 health informatics 10.64898/2026.04.22.26351461 medRxiv

Top 0.3%

14.6%

Show abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

19

The Golden Opportunity or the Cutting Room Floor? Quantifying and Characterizing the Loss and Addition of Social Determinants of Health during Clinician Editing of Ambient AI Documentation

Kim, S.; Guo, Y.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.

2026-04-22 health systems and quality improvement 10.64898/2026.04.20.26351322 medRxiv

Top 0.4%

12.6%

Show abstract

Social determinants of health (SDoH) are important for clinical care, but it remains unclear how much AI-captured social context is preserved after clinician editing in ambient documentation workflows. We retrospectively analyzed 75,133 paired ambient AI-drafted and clinician-finalized note sections from ambulatory care at a large academic health system. Using a rule-based NLP pipeline, we extracted 21 SDoH categories and quantified retention, deletion, and addition. SDoH appeared in 25.2% of AI drafts versus 17.2% of final notes. At the mention level, AI captured 29,991 SDoH mentions, of which 45.1% were deleted, 54.9% were retained with clinicians adding 3,583 new mentions. Insurance and marital status were most often deleted, whereas substance use and physical activity were more often retained. Deletion patterns also varied by specialty, supporting the need for specialty-aware ambient AI systems.

20

Harmonising UK primary care prescription records for research: A case study in the UK Biobank

Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.

2026-04-22 health informatics 10.64898/2026.04.21.26351274 medRxiv

Top 0.4%

12.6%

Show abstract

Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.